Putting Visual Analysis into Practical Use
In this take-home exercise, static and interactive methods should be used properly to explore the health of employers according to the data gathered from participants in Ohio, USA.
Two main questions are explored:
The health of employers: The distribution of jobs and education requirement.
The prosperity of business: The population of visitors throughout the study period.
The following code chunk installs and lauchs packages we need.
packages <- c('ggiraph', 'plotly',
'DT', 'patchwork',
'gganimate', 'tidyverse',
'readxl', 'gifski', 'gapminder',
'treemap', 'treemapify',
'rPackedBar', 'lubridate')
for (p in packages){
if (!require(p,character.only=T)){
install.packages(p)
}
library(p, character.only=T)
}
jobs <- read_csv("data/Jobs.csv")
employers <- read_csv("data/Employers.csv")
CheckinJournal <- read_csv("data/CheckinJournal.csv")
The analysis is based on employer level: the distribution of employers on map, Which employers have more jobs, which employers require higher education level, which employers provide higher salary etc.
Firstly, we join table employers and jobs by key “employerId”.
joined_table <- jobs %>%
left_join(y=employers, by = c("employerId" = "employerId"))
joined_table_only10 <- joined_table[!(joined_table$hourlyRate>=11),]
joined_table_without10 <- joined_table[!(joined_table$hourlyRate<11),]
The histogram shows us the distribution of jobs with different education requirement.
ggplot(data=joined_table,
aes(x=educationRequirement)) +
geom_bar() +
ylab("Jobs") +
theme(axis.title.y=element_text(angle = 0))
The number of jobs with hourly rate = 10 is too large and cannot be included in the following graph. Thus, the following two graph ignores the jobs with hourly rate = 10 and these jobs will be evaluated individually next.
tooltip_css <- "background-color:white;
font-style:bold; color:black;"
joined_table_without10$tooltip <- c(paste0(
"Building ID = ", joined_table_without10$buildingId,
"\n Employer ID = ", joined_table_without10$employerId))
p <- ggplot(data=joined_table_without10,
aes(x = hourlyRate)) +
geom_dotplot_interactive(
aes(tooltip = joined_table_without10$tooltip),
stackgroups = TRUE,
binwidth = 0.8,
method = "histodot") +
coord_cartesian(xlim=c(10,100)) +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 4,
options = list(
opts_tooltip(
css = tooltip_css))
)
p <- ggplot(data=joined_table_without10,
aes(x = hourlyRate)) +
geom_dotplot_interactive(
aes(tooltip = educationRequirement,
data_id = educationRequirement),
stackgroups = TRUE,
binwidth = 0.8,
method = "histodot") +
scale_y_continuous(NULL,
breaks = NULL)
girafe(
ggobj = p,
width_svg = 6,
height_svg = 4,
options = list(
opts_hover(css = "fill: #202020;"),
opts_hover_inv(css = "opacity:0.2;")
)
)
ggplot(data=joined_table_only10,
aes(x=educationRequirement)) +
geom_bar() +
ylab("Jobs") +
theme(axis.title.y=element_text(angle = 0))
First, from CheckinJournal.csv, we extract Pubs and Restaurants.
Then, we generate the day counted from the start date of gathering this dataset (2022-3-1) and month (1st-15th).
Then, we rotate Pubs and Restaurants into columns and calculate population of people check in pubs and restaurants respectively and the population of people in total.
CheckinJournal_buz <- CheckinJournal[(CheckinJournal$venueType=='Pub'| CheckinJournal$venueType=='Restaurant'),]
CheckinJournal_buz$yday <-yday(CheckinJournal_buz$timestamp-59)
CheckinJournal_buz <- subset(CheckinJournal_buz, select = -c(timestamp,venueId) )
CheckinJournal_buz_gb <- CheckinJournal_buz %>% group_by(yday, venueType) %>% summarise(population = n())
CheckinJournal_buz_gb_1 <- CheckinJournal_buz_gb %>%
mutate(i = row_number()) %>%
spread(venueType, population) %>%
select(-i)
CheckinJournal_buz_gb_1[is.na(CheckinJournal_buz_gb_1)] <- 0
CheckinJournal_buz_gb_final <- CheckinJournal_buz_gb_1 %>%
group_by(yday) %>%
summarise(Pub=sum(Pub),Restaurant=sum(Restaurant))%>%
mutate(Population=Pub + Restaurant )
CheckinJournal_buz_gb_final$month <- cut(CheckinJournal_buz_gb_final$yday,
breaks = c(0,31,61,92,123,154,184, 215,245,276,307,335,366,396,421,500),
labels=c('1st','2nd','3rd','4th','5th','6th','7th','8th','9th', '10th','11th','12th','13th','14th','15th'))
head(CheckinJournal_buz_gb_final)
# A tibble: 6 x 5
yday Pub Restaurant Population month
<dbl> <int> <int> <int> <fct>
1 1 1214 1032 2246 1st
2 2 609 982 1591 1st
3 3 361 963 1324 1st
4 4 393 983 1376 1st
5 5 496 957 1453 1st
6 6 476 960 1436 1st
Noted, the No. of days means the day from 2022-3-1, and the month 1st = March.